What could we do if Latent Profile Analysis (LPA) run not as expected? – LPA of Chicago Neighborhoods

Biostatistics Psychology/Sociology Factor Analysis Mclust Mplus Tutorial Mapping

American Census Survey data - Chicago Neighborhood;
Latent Profile Analysis: Gaussian Mixture modeling;
Mclust package;
Mplus;
Exploratory Factor Analysis;

Hai Nguyen
October 24, 2021
library(mclust)
library(tidyverse)
library(nFactors)

Story

Chicago is famous of diverse neighborhood areas. It’s one thing, but how we can measure each neighborhood area based on the socioeconomic status of each area.

The written compose (including step-by-step in coding and explanation) walk us through how to cluster each neighborhood using a modern method: Latent Profile Analysis. However, the method would have some barriers (model selection criteria, homogeneity and separation, normality assumption) in which we need to use another technique to overcome.

Outline:

1- Prepare ACS data
2- What is LPA
3- Barriers of LPA and the Steps to solve it
4- Mapping
5- LPA with covariate(s)

Prepare data

The American Community Survey

The American Community Survey (ACS) is a unique data product that includes all the estimates and margins of error from geographies that are published for the ACS.

Data contained demographic, social, economic, and housing subject areas. All Detailed Tables for the ACS 1-year or 5-year estimates. Here we deal with the ACS 5-year estimates which are published for all geographic areas, including census tracts, block groups, American Indian areas, core based statistical areas, combined statistical areas, Congressional districts, and state legislative districts.

ACS website

ACS data

The summary of the data “Chicago Census Data by Tract_American Census Survey (ACS)”. Data includes Census tract characteristics of 2010-2014 5-year ACS estimates that easy to download from https://data.census.gov/cedsci/.

We will use the following \(11\) variables in the exploratory factor analysis:

Variable name Labeling
* AgeDependencyRatio: age dependency ratio
* LimitedEngProf5andUP: limited English proficiency
* LessThanHS: less than high school in education level
* Unemployment
* PctForeignBorn: percentage of foreign born
* FemaleHHPct: percentage of female per household
* MedianIncomeHH: median income per household
* AllOcc2009Pct: percentage of all occupants in 2009
* PctVacHousing: percentage of vacant housing
* PctBelowPoverty: percentage below poverty
* PublicAssistancePct: percentage of public assistance
census <- read.csv("data/Chicago Census Data by Tract_ACS 2015.csv", header = TRUE) %>% 
  select(tract, AgeDependencyRatio, LimitedEngProf5andUP, LessThanHS, Unemployment, 
         PctForeignBorn, FemaleHHPct, MedianIncomeHH, AllOcc2009Pct, PctVacHousing, 
         PctBelowPoverty, PublicAssistancePct)
# recode into % to scale to other variables
df <- census %>% mutate(
  PctForeignBorn = PctForeignBorn*100,
  FemaleHHPct = FemaleHHPct*100,
  # MedianIncomeHH1Kinv = (max(MedianIncomeHH, na.rm=TRUE) - MedianIncomeHH)/1000,
  # we would not use 1Kinv because of weird distribution (multi-mode)
  MedianIncomeHH1K = MedianIncomeHH/1000,
  AllOcc2009Pct = AllOcc2009Pct*100,
  PctVacHousing = PctVacHousing*100,
  PublicAssistancePct = PublicAssistancePct*100
) %>% 
  select(tract, AgeDependencyRatio, LimitedEngProf5andUP, LessThanHS, Unemployment, 
         PctForeignBorn, FemaleHHPct, MedianIncomeHH1K, AllOcc2009Pct, PctVacHousing, 
         PctBelowPoverty, PublicAssistancePct)
df2 <- df.fa <- df[complete.cases(df),]
summary(df2)
     tract        AgeDependencyRatio LimitedEngProf5andUP
 Min.   : 10100   Min.   :  4.8      Min.   : 0.00       
 1st Qu.:160800   1st Qu.: 39.8      1st Qu.: 1.90       
 Median :351100   Median : 55.0      Median : 8.40       
 Mean   :403974   Mean   : 52.9      Mean   :13.99       
 3rd Qu.:670300   3rd Qu.: 65.3      3rd Qu.:24.00       
 Max.   :843900   Max.   :147.4      Max.   :64.00       
   LessThanHS     Unemployment   PctForeignBorn    FemaleHHPct    
 Min.   : 0.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 8.20   1st Qu.: 7.00   1st Qu.: 3.858   1st Qu.: 8.704  
 Median :16.30   Median :11.90   Median :15.115   Median :16.502  
 Mean   :18.39   Mean   :14.52   Mean   :18.659   Mean   :20.122  
 3rd Qu.:26.20   3rd Qu.:19.80   3rd Qu.:31.076   3rd Qu.:29.859  
 Max.   :62.50   Max.   :91.90   Max.   :71.263   Max.   :70.470  
 MedianIncomeHH1K AllOcc2009Pct   PctVacHousing    PctBelowPoverty
 Min.   :  5.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.40  
 1st Qu.: 29.40   1st Qu.:53.45   1st Qu.: 7.752   1st Qu.:12.50  
 Median : 42.32   Median :62.85   Median :11.722   Median :21.80  
 Mean   : 49.66   Mean   :62.85   Mean   :13.898   Mean   :24.07  
 3rd Qu.: 62.06   3rd Qu.:72.80   3rd Qu.:18.066   3rd Qu.:33.80  
 Max.   :160.46   Max.   :94.06   Max.   :50.673   Max.   :74.20  
 PublicAssistancePct
 Min.   : 0.000     
 1st Qu.: 9.252     
 Median :22.002     
 Mean   :24.440     
 3rd Qu.:36.667     
 Max.   :79.851     

Correlation matrix

cortab <- round(Hmisc::rcorr(as.matrix(df2[,-1]))[[1]],2)
rmarkdown::paged_table(as.data.frame(cortab), options = list(rows.print = 11, cols.print = 11))
corrplot::corrplot(cor(df2[,-1]))

As we can see, there are some factors having much association (Unemployment, FemaleHHPct, PctBelowPoverty, PublicAssistancePct), meanwhile some is not.

Introduction to Latent Profile Analysis

What if running Latent Profile Analysis with all 11 indicators

X <- as.matrix(df2[,-1])
mod <- Mclust(X)

We could use both BIC and ICL. But let’s see the BIC firstly

summary(mod$BIC)
Best BIC values:
             VVE,9        EVE,8        EVE,7
BIC      -60838.39 -60858.68524 -60929.97097
BIC diff      0.00    -20.29489    -91.58063
plot(mod, what = "BIC", ylim = range(mod$BIC[,-(1:2)], na.rm = TRUE),
       legendArgs = list(x = "bottomleft"))

It’s the same if I also used integrated complete-data likelihood (ICL) criterion:

ICL <- mclustICL(X)
plot(ICL)

We run Mclust for all 11 indicators, the data-driven would produce up to 8, 9 clusters.

# tabulate class-membership numbers
table(summary(mod)$classification)

  1   2   3   4   5   6   7   8   9 
 68 137  89  38  52  98  55  79 181 
drmod <- MclustDR(mod, lambda = 1)
plot(drmod, what = "contour")

Why happening?

# display the means per class
rmarkdown::paged_table(as.data.frame(round(mod$parameters$mean,2)), 
                       options = list(rows.print = 11, cols.print = 9))
par(mar = c(4, 4, .1, .1))
plot(density(df2$Unemployment))
plot(density(df2$PctBelowPoverty))

Thus, with the homogeneity and separation concepts:

\(\Rightarrow\) We want to increase the homogeneity. For this purpose, we would use factor analysis to select covariates in the first components.

Factor Analysis

Determine Number of Factors to Extract

ev <- eigen(cor(df.fa[,-1])) # get eigenvalues
ap <- parallel(subject=nrow(df.fa[,-1]), var=ncol(df.fa[,-1]), rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS) 

Running EFA

When we run a factor analysis, we need to decide on three things:

  1. the number of factors
  2. the method of estimation
  3. the rotation

\(\Rightarrow\) Maximum Likelihood Factor Analysis (bullet 2) entering raw data and extracting 3 factors (bullet 1), with varimax rotation (bullet 3)

fit.f <- factanal(x = df.fa[,-1], factors = 3, n.obs = 797, rotation="varimax")
print(fit.f, digits=2, cutoff=.3, sort=TRUE)

Call:
factanal(x = df.fa[, -1], factors = 3, n.obs = 797, rotation = "varimax")

Uniquenesses:
  AgeDependencyRatio LimitedEngProf5andUP           LessThanHS 
                0.37                 0.00                 0.19 
        Unemployment       PctForeignBorn          FemaleHHPct 
                0.29                 0.10                 0.18 
    MedianIncomeHH1K        AllOcc2009Pct        PctVacHousing 
                0.29                 0.21                 0.60 
     PctBelowPoverty  PublicAssistancePct 
                0.12                 0.07 

Loadings:
                     Factor1 Factor2 Factor3
AgeDependencyRatio    0.56            0.56  
Unemployment          0.80                  
FemaleHHPct           0.84                  
MedianIncomeHH1K     -0.80                  
PctVacHousing         0.61                  
PctBelowPoverty       0.93                  
PublicAssistancePct   0.96                  
LimitedEngProf5andUP          0.99          
LessThanHS            0.53    0.69          
PctForeignBorn                0.91          
AllOcc2009Pct                         0.89  

               Factor1 Factor2 Factor3
SS loadings       4.82    2.43    1.33
Proportion Var    0.44    0.22    0.12
Cumulative Var    0.44    0.66    0.78

Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 222.31 on 25 degrees of freedom.
The p-value is 1.46e-33 

Applying factor scores

We calculate and show FA scores – The concentrated disadvantage index scores:

df.fa$concdisadv <- as.matrix(df.fa[,-1]) %*% (fit.f$loadings[,1]*(abs(fit.f$loadings[,1])>0.5))
# print out the scores
rmarkdown::paged_table(cbind(tract = df.fa[1:20,1], 
                             condisadv_fascore = df.fa[1:20,13], 
                             df.fa[1:20,2:12]), 
                       options = list(rows.print = 20, cols.print = 13))

Latent Profile Analysis

Data to run LPA

The latent profile analysis (normal mixture) will use variables loaded on factor 1 identified from the factor analysis:

1- AgeDependencyRatio
2- Unemployment
3- FemaleHHPct
4- MedianIncomeHH1K
5- PctVacHousing
6- PctBelowPoverty
7- PublicAssistancePct
8- LessThanHS

# recode into % to scale to other variables
df3 <- df2 %>% 
  select(tract, AgeDependencyRatio, Unemployment, FemaleHHPct, 
         MedianIncomeHH1K, PctVacHousing, PctBelowPoverty,  
         PublicAssistancePct, LessThanHS
         )

Check how many clusters should create?

X <- as.matrix(df3[,-1])
ICL <- mclustICL(X)
summary(ICL)
Best ICL values:
             VVV,3        VVV,4        VVE,7
ICL      -45689.17 -45706.83796 -45709.47096
ICL diff      0.00    -17.66358    -20.29659
plot(ICL)

modBIC <- Mclust(X)
summary(modBIC$BIC)
Best BIC values:
            VVE,7        VVE,6        VVV,4
BIC      -45583.8 -45597.13834 -45603.11491
BIC diff      0.0    -13.33485    -19.31142
plot(modBIC, what="BIC", ylim=range(modBIC$BIC[,-(1:2)], na.rm=TRUE), 
     legendArgs = list(x = "bottomright"))

drmod <- MclustDR(mod, lambda = 1)
plot(drmod, what = "contour")

mod.3 <- Mclust(X, G = 3)
drmod.3 <- MclustDR(mod.3, lambda = 1)
plot(drmod.3, what = "contour")

Fitted a 3-components GMM with unconstrained covariance matrices (VVV)

mod.3 <- Mclust(df2[,-1], G=3, modelNames = "VVV") 
summary(mod.3) 
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VVV (ellipsoidal, varying volume, shape, and orientation)
model with 3 components: 

 log-likelihood   n  df       BIC       ICL
      -29837.77 797 233 -61232.17 -61266.61

Clustering table:
  1   2   3 
189 329 279 

Mixture probabilities and mean (sd in separate table below) for Census tract characteristics

sum.mod.3 <- summary(mod.3, parameters = TRUE)
rbind(Component = c("Class 1","Class 2","Class 3"),
      `Mix Probability` = paste0(round((sum.mod.3$pro)*100,1),"%"),
      round(sum.mod.3$mean,1),
      Labels = c("Affluent","Poor","Distressed")
)
                     [,1]       [,2]      [,3]        
Component            "Class 1"  "Class 2" "Class 3"   
Mix Probability      "23.7%"    "41.3%"   "35%"       
AgeDependencyRatio   "35.9"     "50.3"    "67.5"      
LimitedEngProf5andUP "6.6"      "27.3"    "3.3"       
LessThanHS           "5.3"      "24.6"    "19.9"      
Unemployment         "5.3"      "11.3"    "24.6"      
PctForeignBorn       "14"       "33.4"    "4.4"       
FemaleHHPct          "5.8"      "15.5"    "35.3"      
MedianIncomeHH1K     "87.1"     "44.9"    "29.9"      
AllOcc2009Pct        "56"       "64.1"    "66"        
PctVacHousing        "9"        "11"      "20.6"      
PctBelowPoverty      "9.3"      "22.5"    "35.9"      
PublicAssistancePct  "5"        "21"      "41.6"      
Labels               "Affluent" "Poor"    "Distressed"
sd.class1 <- round(sqrt(diag(sum.mod.3$variance[,,1])),1)
sd.class2 <- round(sqrt(diag(sum.mod.3$variance[,,2])),1)
sd.class3 <- round(sqrt(diag(sum.mod.3$variance[,,3])),1)
cbind(sd.class1,sd.class2,sd.class3)
                     sd.class1 sd.class2 sd.class3
AgeDependencyRatio        18.6      14.1      15.9
LimitedEngProf5andUP       5.1      12.6       5.2
LessThanHS                 4.4      14.1       8.3
Unemployment               2.6       4.4       9.8
PctForeignBorn             7.2      11.9       5.8
FemaleHHPct                4.0       6.6      11.3
MedianIncomeHH1K          22.9      12.5      11.0
AllOcc2009Pct             16.0      11.3      12.5
PctVacHousing              6.5       4.9       9.7
PctBelowPoverty            5.1       9.6      14.0
PublicAssistancePct        4.1       9.9      14.0

Applying the clustering to mapping on Chicago Neighborhood map

Clustering in class number

df2$ConcDisadv_cluster <- mod.3$classification #13th in order
with(df2, table(ConcDisadv_cluster))
ConcDisadv_cluster
  1   2   3 
189 329 279 

Probability for each tract in each class

df2[,14:16] <- mod.3$z
names(df2)[14:16] <- c("ConcDisadv_prob1", 
                       "ConcDisadv_prob2", 
                       "ConcDisadv_prob3")

Concentrated Disadvantage Clustering Selection based on the highest probability

rmarkdown::paged_table(cbind(tract = df2[1:20,1], 
                             `Concentrated Disadvantage` = df2[1:20,13],
                             df2[1:20,14:16],
                             df2[1:20,2:12]),
                       options = list(rows.print = 20, cols.print = 13))

With the level:

We can map on Chicago Neighborhood map (using ArcGIS) as such

Classification by race-ethnicity

r_eth.df <- read.csv("data/Chicago Census Data by Tract_ACS 2015.csv", header = TRUE) %>% 
  select(tract, PctBlack,PctAsian,PctHispanicLatino,PctWhiteNonHisp)
dff <- merge(df2, r_eth.df, by="tract")
mclust1Dplot(dff[,"PctBlack"], parameters = mod.3$parameters, z = mod.3$z, 
             what = "classification", main = FALSE)
title("Percent Black")

mclust1Dplot(dff[,"PctAsian"], parameters = mod.3$parameters, z = mod.3$z, 
             what = "classification", main = FALSE)
title("Percent Asian")

mclust1Dplot(dff[,"PctHispanicLatino"], parameters = mod.3$parameters, z = mod.3$z, 
             what = "classification", main = FALSE)
title("Percent Hispanic Latino")

mclust1Dplot(dff[,"PctWhiteNonHisp"], parameters = mod.3$parameters, z = mod.3$z, 
             what = "classification", main = FALSE)
title("Percent White")

We can see that:

Thus, neighborhoods with high percent Black focus on class 3 (the first graph). It’s quite contradict with high percent White (i.e. neighborhoods with high percent White focus on class 1, the last graph).

Now, how can we turn the visualization into an analysis.
I will discuss this in a future post.

Further readings

Pugach, Oksana. 2018. “Latent Profile Analysis of Chicago Neighborhoods” Unpublished Work. Chalk Talk - Institute for Health Resereach; Policy - Methodology Research Core.

University of Virginia Library, Research Data Services + Sciences. “Getting Started with Factor Analysis” - https://data.library.virginia.edu/getting-started-with-factor-analysis/

Stats Stackexchange. “How to calculate the loading matrix from the score matrix and a data matrix X (PCA)?” - https://stats.stackexchange.com/questions/447952/how-to-calculate-the-loading-matrix-from-the-score-matrix-and-a-data-matrix-x-p

Luca Scrucca. 2020. “A quick tour of mclust” - https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html

Luca Scrucca, Michael Fop, T. Brendan Murphy,Adrian E. Raftery. “mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models.” R J. 2016 August ; 8(1): 289-317 Footer

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Nguyen (2021, Oct. 24). HaiBiostat: What could we do if Latent Profile Analysis (LPA) run not as expected? -- LPA of Chicago Neighborhoods. Retrieved from https://hai-mn.github.io/posts/2021-10-24-LPA Chicago Neighborhoods/

BibTeX citation

@misc{nguyen2021what,
  author = {Nguyen, Hai},
  title = {HaiBiostat: What could we do if Latent Profile Analysis (LPA) run not as expected? -- LPA of Chicago Neighborhoods},
  url = {https://hai-mn.github.io/posts/2021-10-24-LPA Chicago Neighborhoods/},
  year = {2021}
}